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Abstract — Many studies on the cost-sensitive learning assumed 
that a unique cost matrix is known for a problem. However, 
this assumption may not hold for many real-world problems. 
For example, a classifier might need to be applied in several 
circumstances, each of which associates with a different cost 
matrix. Or, different human experts have different opinions about 
the costs for a given problem. Motivated by these facts, this study 
aims to seek the minimax classifier over multiple cost matrices. 
In summary, we theoretically proved that, no matter how many 
cost matrices are involved, the minimax problem can be tackled 
by solving a number of standard cost-sensitive problems and 
sub-problems that involve only two cost matrices. As a result, a 
general framework for achieving minimax classifier over multiple 
cost matrices is suggested and justified by preliminary empirical 
studies. 



I. Introduction 

In many real world classification problems, different types 
of misclassifications commonly result in different costs. For 
example, in fraud detection problem, predicting a normal client 
as fraud will cut the profit, while predicting a fraud client as 
normal would usually lead to great loss [ 1 1. In these scenarios, 
it would be more desirable to minimize the total cost rather 
than the classification error. This kind of problem is referred to 
as cost sensitive -learning problem [2], and has attracted many 
interests in recent years due to its wide applications in the real 
world (3], H, Q. 

So far, the majority of previous research on cost-sensitive 
learning assumes that the costs for different types of mis- 
classifications, typically represented as a cost matrix, are 
uniquely specified before the classifier is applied to new data. 
Specifically, if the cost matrix is known before the training 
procedure, it can be integrated into the the learning algorithm 
to obtain a classifier with minimum total cost. This can be 
done by modifying the training data according to the cost 
matrix j6), Q, or by extending learning algorithms directly 
JU, 15]. In addition to specialized methods, some alternative 
approaches, which are motivated by other learning problems, 
could also be employed to address cost-sensitive learning 
problems. This category of methods, including calibration 
methods |10|, threshold moving [5| and its variants [11|, 
typically post process the output of a classifier to optimize 
its performance with respect to a objective (e.g., minimize 
the total cost or classification error). In this sense, it is not 
necessary to know the cost matrix in prior to the training phase, 
as long as it becomes available before testing Q 
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1 Sometimes, post-processing can also be considered as a part of the training 
procedure. From this point of view, the cost matrix still needs to be specified 
before training phase is finished. 



All the above-mentioned approaches assume that a unique 
cost matrix is known for a given cost-sensitive problem. 
Unfortunately, in the real world, it could be very difficult 
for a practitioner to specify the cost matrix uniquely, for the 
reason that one may do not have much sense about the exact 
values of misclassification costs, or that the costs may vary 
under different circumstances and thus is uncertain in nature. 
In one word, the cost matrix for a real-world problem may be 
uncertain throughout both training and testing. 

As a matter of fact, the difficulty of specifying a cost matrix 
has been acknowledged by many researchers. In the context 
of ROC analysis fl2l . it is claimed that a classifier can be 
built without any cost information, while still performs well in 
the scenarios where the cost matrix changes. Nevertheless, an 
underlying assumption behind this statement is that threshold 
moving (or any other similar methods) is employed to fine- 
tune the output of the classifier. Hence, as discussed above, 
the specified cost matrix is still required in the post-processing 
phase. Zadrozny and Elkan ifTJl considered the scenario where 
example-based misclassification costs are static but unknown. 
More recently, Liu and Zhou |14] investigates the problem of 
learning with cost intervals. Specifically, the misclassification 
cost is assumed as taking a value within a predefined interval, 
and an approach is developed to train a SVM that performs 
well for every possible value of cost. 

Rather than striving to achieve satisfactory performance 
over all possible cost matrices, the aim of this work is to 
minimize the largest total cost over a finite set of possible 
cost matrices, i.e., to find the minimax classifier. Under mild 
assumptions, we prove that the minimax classifier over multi- 
ple cost matrices can be achieved by solving a set of standard 
cost-sensitive learning problems and a set of sub-problems 
involves only two cost matrices. This finding immediately 
suggests a general framework for seeking minimax classifier 
over arbitrary number of cost matrices. Moreover, since an 
interval can be transformed into a finite set of values via 
discretization, the framework is also applicable to the scenarios 
where only the largest and smallest costs for misclassification 
are available. 

The rest of this paper is organized as follows. Preliminary 
backgrounds and related works are introduced with more 
details in Section 2. Section 3 presents the theoretical analysis 
of the minimax problem. Experimental studies are in following 
section, and we conclude the paper in Section 5. 

II. PRELIMINARIES AND RELATED WORKS 

In this section, we introduce the basic notations and back- 
grounds at first, and then review two works that are closely 
related to this study. One is the work from Liu and Zhou [ 14 1, 
which also deals with the uncertain cost problem, but with 
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different formulation and learning target, and the other focus 
on finding the minimax classifier for uncertain class prior 1 15 1. 

A. Preliminaries 

Given a dataset S = {(xi, yi), . . . , (x n , y n )}, x t = 
{x\, . . . , x m } £ Tt m is the feature vector of instance (xi,yi) 
and yi G {0, 1} is the class label. Suppose that there is no cost 
with correct classification, a cost matrix C can be represented 
by two values cq and c\, denoting the cost of misclassifying 
an instance from class and class 1 respectively. Also, we 
use p and pi = 1 — p to represent the class priors, so there 
are n — np\ and n\ = npi instances in each class. For any 
classifier h from the hypothesis space H, its total cost is, 

L = npop w c Q + npip ia 
= n p w c a + nipoici, 

where piq (poi) is the probability that h misclassifies instances 
from class (1) to class 1 (0). 

B. Learning with Cost Intervals 

In a recent work, Liu and Zhou [14| considered a special 
form of the uncertain cost problem where cq is 1, and C\ 
is uncertain but within a predefined interval [c m i n , c max ). 
Their objective is to construct a classifier that performs well 
for every individual cost within [c m j ra , c max \. Technically, the 
problem was transformed as finding the best surrogate cost c s 
to trained with, i.e., their learning target is, 

min L(h, S, e,l 
hen ' 

s.t. p(L(h,S,c) <e)>l-<5,Vc€ [c min ,c max ] (2) 

A SVM-based algorithm was proposed there, which primar- 
ily minimizes the largest total cost (i.e., L(h, S, c max )) and 
secondarily minimizes the total cost at mean cost c M = 
{cmax + c?tu„)/2, i.e., L(h, S, c„). Solid experimental results 
reported there confirmed the efficacy of the method. 

However, to fit in with the interval formulation, one needs to 
artificially re-scale original cost matrices by different factors to 
assure every cq is 1. Although this re-scaling process does no 
harm to traditional cost-sensitive learning as well as the study 
in fl4l . it makes the comparison of total costs across different 
cost matrices meaningless. Considering that it is generally hard 
or even impossible to find a classifier that performances well 
on all costs over the interval (as suggested by lfT4l itself), the 
best classifier they built may lead to very big total cost on 
original cost matrices for real-world problems. 

C. Minimax Classifier for Uncertain Class Priors 

In the many studies involving the minimax criterion lfl6l . 
those focused their attention on building minimax total cost 
classifier for uncertain class prior [ 15 1 are of particular interest 
to this study. 

Formally, in case of uncertain class prior, the minimax 
classification problem is to find the following classifier, 

hp = arg min max L(h, P, C) (3) 
hen p 



It is well known that the total cost of a fixed classifier is a 
linear function of prior, while the optimal total cost (i.e., the 
Bayesian cost) is a concave function of prior ATI . Therefore, 
suppose the best classifier is h* for a given class prior P* , 
then the total cost function of h* w.r.t. prior would be a 
tangent line of the Bayesian total cost curve at P* , Based 
on these elegant properties, Alaiz-Rodriguez et al proposed 
two algorithms based on neural networks model to find the 
minimax classifier iteratively in |[T5l . Readers interested in the 
details of the algorithms are referred to that paper. 

Notice the deceptively symmetrical positions of prior and 
cost in Eq. ([T}, one may think that all the analysis and 
algorithms w.r.t. the uncertain prior problem can be employed 
directly for the uncertain cost problem concerned in this study. 
Unfortunately, that is not the case. For the reason that both cq 
and ci are free variables (i.e., the sum-to-one property of prior 
does not applied to cost), the concavity of Bayesian total cost 
for prior can not be transformed to cost. In the following, 
we consider the minimax problem for uncertain cost along a 
different way. 

III. Minimax Classifier for Uncertain Cost 

A. Problem Formulation 

As mentioned above, this study focuses on minimizing the 
largest total cost over a finite set of possible cost matrices. 
Formally, given a set of cost matrices U — {C%, . . . Ck}, 
where Cj = {c l Q , c\} is the z-th cost matrix, the learning target 
is to find, 

hjj = argminmaxL(/i, S, C). (4) 

hen C€U 

Since the uncertain cost is formulate as a set directly, the 
problem is widely applicable in practice, ready for future study 
on multi-class problems, and facilitating theoretical analysis. 
On the other hand, the best classifier selected by the minimax 
criterion is much more reliable. 

B. Problem Analysis 

For two different cost matrices Cj and Cj in U, if both 
Co < Cq, and c\ < cj, then the total cost of any classifier h 
obtained on Ci will be smaller than that on Cj. In this case, 
we say that Q is dominated by Cj . Furthermore, if there exist 
a cost matrix Cd that dominates all others in U, the above 
minimax problem can be simplified as a standard cost-sensitive 
learning problem with fixed cost matrix Cd- Therefore, given a 
minimax classification problem over a set of cost matrices, the 
first step one should take is to check and delete cost matrices 
that are dominated by any other cost matrix in U . 

On the other hand, the performance of a classifier h from the 
hypothesis space % can be mapped to a point in the 2-D space 
with pxq as the x-axis and p i as the y-axis. Similarly, for two 
different classifiers h a and hb, if pJo — Pio> an< ^ Poi — Pov 
then Lh a < Lh b , no matter what the cost matrix is. In this case, 
we say that h a dominates hb- If a classifier is not dominated 
by any other classifier in W, it is a non-dominated classifier. 
Following the concept in economics [18|, the front formed 
by all non-dominated classifiers in H is named as the Pareto 
front (see Fig. [TJ. When H is an infinite hypothesis space and 
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dataset S consist of enough samples, the front is continues. 
Obviously, for both standard cost-sensitive learning problem 
and the minimax problem concerned in this study, the optimal 
classifiers must be on the Pareto front. 



P01 
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Pareto front 

/ 
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Fig. 1. Mapping classifiers to the 2-D space with pio as the x-axis and poi 
as the y-axis. h a dominates hi, hi are non-dominated classifiers, hence 
on the Pareto front. 

Let us firstly consider the situation with only one cost matrix 
C, Lemma [T] reveals the relative order between the total costs 
of any two classifiers on the front. 

Lemma 1. For any two classifiers hi, hi on the Pareto front, 
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Proof: According to Eq. ([T| 



Lin = n c p^ + uiCiPqI 
Lh 2 = n. CoPio + nidpgl 



Therefore, 



n c Q Pi^ + nicip^l > n c Pio 



Lin > L h2 

+ UiCiPqI 
Pot -Pot 
Pio -Pio 



> 
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Similarly, the other two cases also validated. ■ 
Notice that the left hand of each case in Lemma [T] 
(Pol — Pot) /(p 10 ~ Pio)' i s me abstract value of the slope 
of the segment connected hi and /12, which is determined 
by classifiers' performance, and (rio c o)/( 7l i c i)> on the other 
hand, is a constant given dataset S and cost matrix C. That 
is, geometrically, the relative order between the total costs of 
a pair of classifiers on the front is determined by slope of the 
segment connected these two classifiers. 

Furthermore, since all the dominated classifiers can be 
ignored w.r.t. our problem, total cost Eq.([T]i can be treated as a 
function of the classifiers on the Pareto front. For briefness, we 
further consider it as a function of pxo> an d keep in mind that 
Poi is determined correspondingly. Hereafter, we denote the 



total cost function as Lc(pio) for cost matrix C. The following 
lemma describes the track of Lc(pio) along the front. 

Lemma 2. Assume the Pareto front is conve^ then the total 
cost function Lc(pio) decreases monotonically to its minimum 
at first, and then increases monotonically over the front. 



Proof: Given any three adjacent classifiers on the front, 
hi, hi, without loss of generality, we suppose < p^ < 
Piq. Since the curve of the Pareto front is decreasing and 
convex, /13 must lay on the right-side of the line passes hi and 



h 2 . Plus the fact that p\^ < pjg < pjg and p^{ > > p^, 
we have 
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Since /i 1 ,/i 2 7^-3 are three arbitrary adjacent classifiers on the 
front, it comes that the abstract value of the slope is adjacently 
and monotonically non-increasing along the front. 
Suppose the classifier of minimal total cost for cost matrix 
C is he, with the total cost Lc{pio = Pio), then ac- 
cording to Lemma [T] Lc(pw) decreases monotonically to 
(Piq , Lcipw)) at first, and then increases monotonically. ■ 
In fact, Lemma [2] describes the behavior of total cost 
function for standard cost-sensitive learning problem (i.e., 
with only one cost matrix), and many cost-sensitive learning 
methods published in the literature could be used, hopefully, 
to find the minimum point. See Fig. [2] for an illustratiorj^] 




Fig. 2. The total cost curve vs. classifiers on the Pareto front. From the 
perspective of pio, it also decreases at first, and then increases. 

Now, we are ready for considering the situation with multiple 
cost matrices. For a set of k cost matrices, there are k total 
cot curves correspondingly. Each of them decreases to its own 
minimum at first and then increases. Fig. [3] shows a example 
consist of two cost matrices. We can see that, in this case, 
the minimax total cost locates at the cross point of these two 

2 Analogous to the ROCCH technique, in case that the Pareto front is not 
convex, one can construct the convex hull of all non-dominated classifiers as 
the surrogate Pareto front. Please refer to ['12'], particularly Theorem 7 there, 
for further details. 

3 Note the total cost curve was drawn for illustration purpose, the convexity 
it appears is not implied nor has been proved. 
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Fig. 3. The total cost curves for C\ and C*2- Each of them decreases at first, 
and then increases monotonously, hi and /12 are the best classifiers for C\ 
and C2, and h c is the minimax classifier for these two cost matrices. 

curves. Generally, the position of the minimax classifier for 
multiple cost matrices is confined by the following theorem. 

Theorem 1. For k different total cost curves, each of them 
decreases to its own minimum at first and then increases 
monotonically, the minimax total cost locates at one of the 
two types of positions, 

1) minimum point of an individual curve, 

2) point where curves get crossed. 

Proof: Suppose the minimax total cost locates at neither 
one of the two types of positions, without loss of generality, 
we assume it is on the total cost curve of Ci (i.e., Lq)- Note 
that the minimax classifier is hjj, we have, 

There is no equality because the the minimax total cost is 
not obtained at a type-2 point. On the other hand, since 
the minimax total cost is obtained neither at a type-1 point, 
according to Lemma |2j we can find another classifier h! u such 
that, 

L Cl (pfr ) > L Ci (pft ) > L C] (p$ ) for i j. 

This means the minimax total cost can be reduced, which con- 
flicts with the definition of minimax. Therefore, the theorem 
is validated. ■ 
So, in order to find the minimax classifier, we just need 
to examine every classifier corresponds to the two types of 
positions. However, without further information, any pair of 
total cost curves may cross each other several times in practice, 
hence it would be very expensive or even impossible to 
examine all these points without omission. Fortunately, this 
obstacle can be removed elegantly by the following corollary. 

Corollary 1. For a set of k cost matrices U = {C\, . . . , C^}, 
the minimax total cost classifier hjj belongs to one of following 
two categories, 

1) classifiers that minimize the total cost for an individual 
cost matrix, 

2) classifiers that minimax the total cost for a pair of cost 
matrices. 



Proof: According to Theorem [T] if the minimax total cost 
is obtained at one of the type-1 positions, then the minimax 
classifier fall into the first category, thus the corollary is true. 
Otherwise, the minimax total cost is obtained at a cross point 
of total cost curves. We know that there are at least two total 
cost curves with different monotonic property at the cross 
point, otherwise, we can move hu in the direction that all 
involved curves are decreasing, leading to reduced minimax 
total cost. Let the Lc ; is decreasing, and Lc is increasing 
at hu, then we know from Lemma [2] that the maximal total 
cost for (Ci,Cj) is bigger with all other classifiers. Hence, 
the cross point is the also the minimax total cost for (Ci, Cj). 
So, the corollary is also true in this case. ■ 
According to Corollary [TJ the minimax classification prob- 
lem over multiple cost matrices is reduced to solving a set 
of standard cost-sensitive learning problems and a set of sub- 
problems involves only two cost matrices, saving the bother 
to consider the tradeoff among multiple cost matrices. Finally, 
the framework for solving the minimax classification problem 
over a set of cost matrices is summarized in Algorithm [T] 

Algorithm 1 Framework of solving the minimax classification 
problem over a set of cost matrices 

Input: dataset S, a set of cost matrices U = {C\, . . . ,Ck} 

deletes all dominated cost matrices in U, 
if U= {Ci} then 

hu = argmin, l£W L(h, S, Ci) 
else if U = {Ci,C 2 } then 

h v = argmin, l£H max Ce{CliC2} L(h, S, C) 
else 

V = 

for i = 1 to \U\ do 

find hd — ar g mm /i£-H L(h, S, Ci) 

V = V\J{hc t } 
end for 

for i = 1 to \U\ do 

for j = 1 + 1 to \U\ do 

find hij = argmin /ieW maxc e / Ci ,c 3 } L(h, S, C) 
V = V\J{h l3 } 
end for 
end for 

hu = arg min^g v niax C6[; L(h, S, C) 
end if 

Return: the minimax classifier hu 



IV. Experiments 

In the experiments, we compared three frameworks for 
solving the minimax problem. The first is to build the mini- 
mum total cost classifier for each possible cost matrix without 
considering any tradeoff among cost matrices at first, and then 
picks out the minimax classifier, the second is our framework 
described above, and the third one is to build the minimax 
classifier directly with all the possible cost matrices are under 
consideration simultaneously. For briefness, we denote these 
three frameworks as S, SP, and M respectively. 
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TABLE I 

Summary of the 10 datasets used in the empirical studies 



Dataset 


No. of Features 


Class Distribution 


australian 


14 


307:383 


crx 


1 D 


jU / .joj 


german 


24 


700:300 


heart 


13 


150:120 


hill-valley 


100 


612:600 


house-votes 


16 


168:267 


kr-vs-kp 


36 


1669:1527 


mushroom 


22 


3916:4208 


sonar 


60 


97:111 


wdbc 


30 


357:212 



A. Implementation 

Although there are many standard cost-sensitive learning 
methods, striving to minimize the total cost for one cost 
matrix, can be used to implement S and one part of SP, to 
the best of our knowledge, there is no particular method that 
can be used to implement M or the the other part of SP (i.e., 
minimax the total cost for two or more cost matrices). Hence, 
for the comparison purpose, we adopted a simplified form of 
the Generalized Additive Model (GAM) to implement all the 
three frameworks. Therefore, the empirical studies presented 
underneath are preliminary, and only intend to serve as a 
baseline for future study. 

The GAM used to implement all the compared frameworks 

is, 



F(x) = sign(^2fi(x)) 



(5) 



where T is the number of iterations, and fi is a decision 
stump, whose output is 1 or —1. At each generation, we 
add one decision stump such that the current ensemble of 
decision stump Fi get improved performance over on 
the predefined objective. This process repeats until the iteration 
number is ran out or there is no improvement. 

With this simple GAM procedure, we are able to implement 
the three above-mentioned frameworks. That is, all necessary 
building blocks can be generated by setting the "predefined 
objective" to minimize the total cost for a single cost matrix, or 
minimax the total cost for a pair of cost matrices, or minimax 
the total cost for a set of cost matrices. 

B. Experimental Setup 

Ten datasets from the UCI machine learning repository 
|[T9l were used in the experiments. Brief information about 
these datasets is summarized in Table U Most of these ten 
classification problems are originally real-world cost-sensitive 
problems, for example the australian, crx, and german prob- 
lems are fraud detection problems, while the heart, mushroom 
and wdbc problems are related to health of people. For these 
problems, the misclassification cost matrix is usually hard if 
not impossible to specified by practitioners, so the experiments 
on them are appropriate. 

For each of the datasets, we compared the 3 frameworks 
on 4 set of cost matrices of different cardinalities. They are 
sets of 3 cost matrices, 5 cost matrices, 10 cost matrices, and 



20 cost matrices. The value of each element of the matrices 
is randomly generated within [0, 10). Besides, it is assured in 
advance that there is no dominated cost matrix in each set. 

The iteration number in the GAM is set to 50, and 20 times 
5-fold cross validation procedure was employed to obtain 
stationary results. Hence, for each of 10 x 4 x 3 configurations 
of (dataset, set of cost matrices, compared method), there are 
20 x 5 total cost values. Based on these values, we furthermore 
conducted Wilcoxon signed rank test between SP method and 
the other two methods with significance level 5%. 

C. Results 

Table [II] and Table III present the comparisons over each 
dataset on training and testing respectively. The value in each 
cell is the average total cost over 20 times 5-fold, and the 
best performance for each (dataset, cost set) configuration is 
in boldface. Moreover, the results of Wilcoxon signed rank 
test are denoted as superscripts on the values of S and M 
methods, a superscript of 1 indicates the performance of SP is 
significantly better than that of corresponding method, —1 for 
significantly worse, and no superscript means there is no sta- 
tistically significant difference between SP and corresponding 
method. 

In summary, we can see that SP outperforms the other two 
methods in almost all cases, and keeps statistically comparable 
for the rest few cases. There is no case that SP is statistically 
worse (i.e., there is no —1 on the superscripts). 

Of course, it is not surprising at all that SP defeats S 
completely in the experiments, since SP always checks a 
superset of classifiers compared to S. But these results at least 
provide a evidence that the S framework is not adequate for 
uncertain costs problems. Moreover, with a closer examination 
of the results in each fold, we can see that the performance of S 
and SP are identical sometimes, and SP is better if they are not. 
This is consistent with Corollary [T] since the classifier obtained 
by S could be the optimal minimax classifier in theory. 

On the other hand, the superior of SP over M is more in- 
teresting. Unlike the S framework, M searches the hypothesis 
space with the true learning target directly (i.e., the minimax 
target). Therefore, the most plausible explanation is that the 
implementation of the M framework is no effective enough. 
Since ideally it could perform as good as the SP framework. 
However, as the similar problem encountered in multi-class 
classification problems, designing algorithms that can handle 
multiple tradeoff simultaneously is never a trivial work. 

In summary, although we implemented the three compared 
frameworks with a preliminary and less effective model, the 
result reported in the paper confirms the efficacy of Corollary 
[T] Once we are equipped with particular designed method 
can solve the minimax problem over only two cost matrices 
effectively, it would be very exciting to see the full advantage 
of the SP framework. 

V. Conclusions and Discussions 

For many real-world cost-sensitive learning problems, the 
costs associated with misclassifications are uncertain in na- 
ture. Many existing cost-sensitive learning algorithms, which 
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TABLE II 

Total costs of the three methods on each dataset with different number of cost matrices during training 







cost matric£ 






5 cost matric 






10 cost matric 




20 


cost matric 


es 


DataSet 


S 


SP 


M 


s 


SP 


M 


S 


SP 


M 


s 


SP 


M 


australian 


1440.49 1 


1367.74 


1466.33 1 


1869.28 1 


1587.75 


2199.41 1 


1916.21 1 


1515.35 


2363.94 1 


2096.11 1 


1439.85 


2481. 22 1 


crx 


282.02 1 


272.43 


937.48 1 


1758.45 1 


1533.96 


1962.19 1 


1898.56 1 


1420.8 


2187.72 1 


2170.47 1 


1463.45 


2432 .3 1 


german 


2625. 29 1 


2249.45 


3060.18 1 


2099. 12 1 


1998.53 


3727. 99 1 


1895. 13 1 


1593.92 


3733. 35 1 


1995. 7 1 


1752.62 


4060. 1 1 


heart 


43B.S8 1 


344.39 


605.44 1 


413.54 1 


345.72 


619. 3 1 


302. 15 1 


214.9 


667.01 1 


333. 18 1 


278.55 


768. 89 1 


hill-valley 


3109. 49 1 


2227.36 


2228.87 1 


2370.07 1 


2070.78 


2070.78 


3095. 08 1 


2634.57 


2634.97 


3809. 36 1 


2870.9 


2870.9 


house- votes 


670. 16 1 


458.55 


823. 89 1 


807.34 1 


687.18 


1001. 34 1 


772. 51 1 


449.74 


1004. 26 1 


575.23 1 


387.44 


1087. 35 1 


kr-vs-kp 


9539. 25 1 


8309.86 


9976.34 1 


7025.3 1 


6495.6 


9237.13 1 


6409. 04 1 


4561.48 


9372. 19 1 


7649. 38 1 


5423.56 


11482. 27 1 


mushroom 


22280. 14 1 


14009.54 


26392. 35 1 


14268. 4 1 


10101.95 


17715. 29 1 


20820.21 1 


10103.79 


23435. 58 1 


16941. 87 1 


8202.44 


24488. 3 1 


sonar 


551.62 1 


489.32 


571.2 1 


280. 42 1 


257.66 


382.92 1 


451.66 1 


369.81 


600.6 1 


316. 12 1 


236.27 


404.28 1 


wdbc 


585.49 


549.76 


1429.25 1 


450. 72 1 


375.75 


1277.75 1 


573.19 1 


331.63 


1501.69 1 


404.9 1 


233.12 


1439.93 1 



TABLE III 

Total costs of the three methods on each dataset with different number of cost matrices during testing 







3 cost matrices 




5 cost matrices 






10 cost matrices 




20 cost matrices 


DataSet 


S 


SP 


M 


S 


SP 


M 


S 


SP 


M 


S 


SP 


M 


australian 


363. 1 1 


350.57 


374.96 1 


471.13 1 


403.58 


550.78 1 


483.6 1 


390.21 


595.37 1 


526.1 1 


373.01 


623 1 


crx 


71.19 


70.74 


235.8 1 


440.41 1 


386.28 


493.3 1 


476.17 1 


362.7 


547.51 1 


542.86 1 


375.82 


611.52 1 


german 


664.42 1 


567.29 


763.08 1 


531.22 1 


512.17 


936.35 1 


480.25 1 


411.74 


937.4 1 


506.24 1 


451.19 


1017.17 1 


heart 


116.32 1 


97.64 


155.65 1 


111.4B 1 


97.22 


155.54 1 


82.09 1 


64.5 


169.22 1 


91. II 1 


83.71 


194.97 1 


hill-valley 


778.84 1 


578.5 


578.91 


604.4 1 


536.96 


536.96 


789.11 1 


684.93 


685.41 


984.32 1 


748.74 


748.74 


house-votes 


171.27 1 


118.62 


206.69 1 


204.17 1 


176.01 


250.3 1 


197.44 1 


122.75 


254.13 1 


149.98 1 


104.16 


272.02 1 


kr-vs-kp 


2380.23 


2081.5 


2495.92 1 


1757.88 1 


1624.11 


2318.33 1 


1609.81 1 


1147.79 


2344.12 1 


1905.09 


1363.94 


2869.24 1 


mushroom 


5578.79 1 


3506.46 


6603.53 1 


3569.52 1 


2546.44 


4430.6 1 


5178.54 1 


2520.05 


5872.11 1 


4269.31 


2040.03 


6114.33 1 


sonar 


142.89 1 


130.41 


147.37 1 


72.75 


68.58 


97.75 1 


119.fi 1 


103.21 


153.96 1 


84.83 1 


75.64 


103.28 1 


wdbc 


150.34 


142.44 


363.01 1 


116.01 


103.1 


320.44 1 


151.82 1 


95.9 


377.93 1 


109.95 1 


68.94 


360.53 1 



require the exact cost information (e.g., a unique cost matrix) 
being available, are not applicable for these problems. In this 
paper, we consider the situation where the cost information 
is provided as a set of cost matrices, and aim to achieve the 
minimax classifier over the cost matrices. It is theoretically 
proved that the classifier with minimax total cost is either 
the optimal classifier for a single cost matrix in the set, or 
the minimax classifier over a pair of cost matrices in the set. 
This result immediately leads to a framework for achieving 
minimax classifier over arbitrary number of cost matrices. Fur- 
thermore, it is also applicable in case that the cost information 
is provided as an infinite set, e.g., intervals, by combining with 
an appropriate sampling/discretization procedure. Preliminary 
empirical study has justified the efficacy of the framework. 

Although there exist a lot of algorithms for standard cost- 
sensitive learning problems, achieving minimax classifier over 
a pair of cost matrices remains the major technical obstacle. 
Therefore, novel algorithms should be developed for this 
purpose to exploit the usefulness of our framework to the full 
extent. Furthermore, the theoretical analysis conducted in this 
work needs to be extended to multi-class problems so that the 
resultant framework can be generalized. These issues will be 
investigated in the future. 
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